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Frank Vahid, Ann Gordon-Ross 

August 2001 Proceedings of the 2001 international symposium on Low power 
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Full text available: ^ [ pdf(230.48 KB) Additional Information: full citation , references , citings, index terms 
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2 Profile-based optimizations: Dynamic trace selection using performance monitoring 
hardware sampling 

Howard Chen, Wei-Chung Hsu, Jiwei Lu, Pen-Chung Yew, Dong-Yuan Chen 

March 2003 Proceedings of the international symposium on Code generation and 

optimization: feedback -directed and runtime optimization CGO '03 
Publisher: IEEE Computer Society 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: ■Q pdffl .88 MB) 



Optimizing programs at run-time provides opportunities to apply aggressive optimizations 
to programs based on information that was not available at compile time. At run time, 
programs can be adapted to better exploit architectural features, optimize the use of 
dynamic libraries, and simplify code based on run-time constants. Our profiling system 
provides a framework for collecting information required for performing run-time 
optimization. We sample the performance hardware registers available on ... 

3 Software for high-performance systems: Identifying potential parallelism via loop- 
centric profiling 

Tipp Moseley, Daniel A. Connors, Dirk Grunwald, Ramesh Peri 

May 2007 Proceedings of the 4th international conference on Computing frontiers CF 
'07 

Publisher: ACM Press 

Full text available: ^) pdf(278.02 KB) Additional Information: full citation , abstract , references , index terms 

The transition to multithreaded, multi-core designs places a greater responsibility on 
programmers and software for improving performance; thread-level parallelism (TLP) will 
be increasingly relied upon in addition to instruction-level parallelism (ILP) and increased 
clock frequency. Deciding where to try to parallelize code is difficult, especially for large, 
complex applications or those where the original developers have moved on. Outer loops 
are relatively easy targets for parallelization ... 
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Extending Path Profiling across Loop Backedges and Procedure Boundaries 
Sriraman Tallam, Xiangyu Zhang, Rajiv Gupta 

March 2004 Proceedings of the international symposium on Code generation and 

optimization: feedback-directed and runtime optimization CGO '04 
Publisher: IEEE Computer Society 

Full text available: ^ pdf(41 6.54 KB) Additional Information: full citation , abstract , citing s, index terms 

Since their introduction, path profiles have been used toguide the application of 
aggressive code optimizations andperforming instruction scheduling. However, for 
optimizationand scheduling, it is often desirable to obtain frequencycounts of paths that 
extend across loop iterations and crossprocedure boundaries. These longer paths, referred 
to asinteresting paths in this paper, account for over 75% of theflow in a subset of SPEC 
benchmarks. Although the frequencycounts of interesting paths can b ... 

Keywords: path profiles, overlapping path profiles, profileguided optimization, and 
instruction scheduling 



Microprocessor architecture: Frequent loop detection using efficient non-intrusive on- 
chip hardware 
Ann Gordon-Ross, Frank Vahid 

October 2003 Proceedings of the 2003 international conference on Compilers, 
architecture and synthesis for embedded systems CASES '03 

Publisher: ACM Press 

Full text available* fi3 pdf(278 07 KB) Additjonal Information: full citation , abstract , references , citings , index 
. yyon : terms 

Dynamic software optimization methods are becoming increasingly popular for improving 
software performance and power. The first step in dynamic optimization consists of 
detecting frequently executed code, or "critical regions." Previous critical region detectors 
have been targeted to desktop processors. We introduce a critical region detector 
targeted to embedded processors, with the unique features of being very size and power 
efficient, and being completely non-intrusive to the software's exec ... 

Keywords: dynamic optimization, frequent loop detection, frequent value profiling, 
hardware profiling, hot spot detection, on-chip profiling, runtime profiling 



Continuous program optimization: A case study 
Thomas Kistler, Michael Franz 

July 2003 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 25 Issue 4 
Publisher: ACM Press 

Full text available: If ) pdf(877.67 KB) Additional Information: full citation , abstract , references , citings, index 

: terms , review 

Much of the software in everyday operation is not making optimal use of the hardware on 
which it actually runs. Among the reasons for this discrepancy are hardware/software 
mismatches, modularization overheads introduced by software engineering 
considerations, and the inability of systems to adapt to users* behaviors. A solution to 
these problems is to delay code generation until load time. This is the earliest point at 
which a piece of software can be fine-tuned to the actual capabilities of the ... 

Keywords: Dynamic code generation, continuous program optimization, dynamic 
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Practicing JUDO: Java under dynamic optimizations 
Michat Cierniak, Guei-Yuan Lueh, James M. Stichnoth 
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Full text available: *g[ pdf(1 90.06 KB) Additional Information: full citation , abstract , references , citings , index 

terms 

A high-performance implementation of a Java Virtual Machine (JVM) consists of efficient 
implementation of Just-In-Time (JIT) compilation, exception handling, synchronization 
mechanism, and garbage collection (GC). These components are tightly coupled to 
achieve high performance. In this paper, we present some static anddynamic techniques 
implemented in the JIT compilation and exception handling of the Microprocessor 
Research Lab Virtual Machine (MRL VM), ... 



Optimally profiling and tracing pro g rams 
Thomas Ball, James R. Larus 

July 1994 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 16 Issue 4 
Publisher: ACM Press 

Full text available* pdf(2.84 MB) Additional Information: full citation , abstract , references , citings , index 
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This paper describes algorithms for inserting monitoring code to profile and trace 
programs. These algorithms greatly reduce the cost of measuring programs with respect 
to the commonly used technique of placing code in each basic block. Program profiling 
counts the number of times each basic block in a program executes. Instruction tracing 
records the sequence of basic blocks traversed in a program execution. The algorithms 
optimize the placement of counting/tracing code with respect to the ... 

Keywords: control-flow graph, instruction tracing, instrumentation, profiling 



9 High-level power estimation (invited talks): Source code optimization and profiling of 
energy consumption in embedded systems 

Tajana Simunic, Luca Benini, Giovanni De Micheli, Mat Hans 

September 2000 Proceedings of the 13th international symposium on System 
synthesis ISSS 'OO 

Publisher: IEEE Computer Society 

Full text available: ^pdf(81.88 KB) Additional Information: full citation , abstract , references , citings 

This paper presents a source code optimization methodology and a profiling tool that have 
been developed to help designers in optimizing software performance and energy in 
embedded systems. Code optimizations are applied at three levels of abstraction: 
algorithmic, data and instruction-level. The profiler exploits a cycle-accurate energy 
consumption simulator [3] to relate the embedded system energy consumption and 
performance to the source code. Thus, it can be used for analysis (i.e., to find ... 

10 Performance of Runtime Optimization on BLAST 

Abhinav Das, Jiwei Lu, Howard Chen, Jinpyo Kim, Pen-Chung Yew, Wei-Chung Hsu, Dong- 
Yuan Chen 

March 2005 Proceedings of the international symposium on Code generation and 
optimization CGO '05 

Publisher: IEEE Computer Society 

Full text available: Qpdf(218.95 KB) Additional Information: full citation , abstract , index terms 

Optimization of a real world application BLAST is used to demonstrate the limitations of 
static and profile-guided optimizations and to highlight the potential of runtime 
optimization systems. We analyze the performance profile of this application to determine 
performance bottlenecks and evaluate the effect of aggressive compiler optimizations on 
BLAST. We find that applying common optimizations (e.g. 03) can degrade performance. 
Profile guided optimizations do not show much improvement across t ... 



11 Profiling tools for hardware/software partitioning of embedded applications 
^ Dinesh C. Suresh, Walid A. Najjar, Frank Vahid, Jason R. Villarreal, Greg Stitt 
^ June 2003 ACM SIGPLAN Notices , Proceedings of the 2003 ACM SIGPLAN conference 

on Language, compiler, and tool for embedded systems LCTES '03, volume 

38 Issue 7 
Publisher: ACM Press 



Additional Information: full citation , abstract , references , citings , index 



Full text available: 1 ^ pdf(228. 38 KB) 



terms 



Loops constitute the most executed segments of programs and therefore are the best 
candidates for hardware software partitioning. We present a set of profiling tools that are 
specifically dedicated to loop profiling and do support combined function and loop 
profiling. One tool relies on an instruction set simulator and can therefore be augmented 
with architecture and micro-architecture features simulation while the other is based on 
compile-time instrumentation of gcc and therefore has very litt ... 

Keywords: compiler optimization, ha rdw a re/ software partitioning, loop analysis 



12 1 - Special Section: Link-time compaction and optimization of ARM executables 
Bjorn De Sutter, Ludo Van Put, Dominique Chanet, Bruno De Bus, Koen De Bosschere 
February 2007 ACM Transactions on Embedded Computing Systems (TECS), Volume 6 
Issue 1 

Publisher: ACM Press 

Full text available: pdf(636.53 KB) Additional Information: full citation , abstract , references , index terms 

The overhead in terms of code size, power consumption, and execution time caused by 
the use of precompiled libraries and separate compilation is often unacceptable in the 
embedded world, where real-time constraints, battery life-time, and production costs are 
of critical importance. In this paper, we present our link-time optimizer for the ARM 
architecture. We discuss how we can deal with the peculiarities of the ARM architecture 
related to its visible program counter and how the introduced over ... 

Keywords: Performance, compaction, linker, optimization 




13 Predicting program behavior using real or estimated profiles 
^ David W. Wall 

V May 1991 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1991 conference 
on Programming language design and implementation PLDI '91, volume 26 
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Publisher: ACM Press 

Full text available: Q pdf(977.81 KB) Additional Information: full citation , references , citings , index terms 



14 Compilation and dynamic optimization: Dynamic parallelization and mapping of 

^ binary executables on hierarchical platforms 
™ Efe Yardimci, Michael Franz 

May 2006 Proceedings of the 3rd conference on Computing frontiers CF '06 

Publisher: ACM Press 

Full text available: ^pdf(241.09 KB) Additional Information: full citation , abstract , references , index terms 

As performance improvements are being increasingly sought via coarse-grained 
parallelism, established expectations of continued sequential performance increases are 
not being met. Current trends in computing point towards platforms seeking performance 
improvements through various degrees of parallelism, with coarse-grained parallelism 
features becoming commonplace in even entry-level systems. Yet the broad variety of 
multiprocessor configurations that will be available that differ in the number o ... 

Keywords: continuous optimization, dynamic parallelization 



15 The Accuracy of Initial Prediction in Two-Phase Dynamic Binary Translators 
Youfeng Wu, Mauricio Breternitz, Justin Quek, Orna Etzion, Jesse Fang 
March 2004 Proceedings of the international symposium on Code generation and 

optimization: feedback-directed and runtime optimization CGO '04 
Publisher: IEEE Computer Society 

Full text available: Q pdf(234.04 KB) Additional Information: full citation , abstract , citings , index terms 

Dynamic binary translators use a two-phase approachto identify and optimize frequently 
executed codedynamically. In the first step (profiling phase), blocks ofcode are 



interpreted or quickly translated to collectexecution frequency information for the blocks. 
In thesecond phase (optimization phase), frequently executedblocks are grouped into 
regions and advancedoptimizations are applied on them.This approach implicitly assumes 
that the initialprofile of each block is representative of the blockt ... 

16 Using branch handling hardware to support profile-driven optimization 
43k Thomas M - Conte, Burzin A. Patel, J. Stan Cox 

November 1994 Proceedings of the 27th annual international symposium on 

Microarchitecture MICRO 27 
Publisher: ACM Press 

Full text available* HI pdf(954 48 KB) Additiona ' Information: full citation , abstract , references , citings , index 

terms 

Profile-based optimizations can be used for instruction scheduling, loop scheduling, data 
preloading, function in-lining, and instruction cache performance enhancement. However, 
these techniques have not been embraced by software vendors because programs 
instrumented for profiling run 2-30 times slower, an awkward compile-run-recompile 
sequence is required, and a test input suite must be collected and validated for each 
program. This paper proposes using existing bran ... 

17 Toward a methodolo g y of optimizing programs for high-performance computers 
Rudolf Eigenmann 

N/ August 1993 Proceedings of the 7th international conference on Supercomputing ICS 
•93 

Publisher: ACM Press 

Full text available* f gD pdf(1.20 MB) Additlonal Information: full citation , abstract , references , citings, index 
' terms ' 

This report describes experiences in porting the Perfect Benchmarks programs to the 
Alliant FX/8 and Cedar machines. Although there is much to learn before we can really 
say we know how to optimize programs successfully for parallel computers in general and 
for hierarchical shared-memory architectures in particular, it is hoped that this report will 
be helpful to those who are beginning to port and transform programs for parallel 
machines. Perhaps it is even an interesting reference for exp ... 

18 Handling irreducible loops: optimized node splitting versus DJ-graphs 
A Sebastian Unger, Frank Mueller 

July 2002 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 24 Issue 4 
Publisher: ACM Press 

Full text available* - f£l pdf(386 1 1 KB) Add ' tiona ' Information: full citation , abstract , references , citings , index 
' : terms 

This paper addresses the question of how to handle irreducible regions during 
optimization, which has become even more relevant for contemporary processors since 
recent VLIW-like architectures highly rely on instruction scheduling. The contributions of 
this paper are twofold. First, a method of optimized node splitting to transform irreducible 
regions of control flow into reducible regions is formally defined and its correctness is 
shown. This method is superior to approaches previously publishe ... 

Keywords: Code optimization, compilation, control flow graphs, instruction-level 
parallelism, irreducible flowgraphs, loops, node splitting, reducible flowgraphs 



19 Interaction cost and shotgun profiling 

Brian A. Fields, Rastislav Bodik, Mark D. Hill, Chris J. Newburn 

September 2004 ACM Transactions on Architecture and Code Optimization (TACO), 

Volume 1 Issue 3 
Publisher: ACM Press 

Full text available* - Qpdf(647 17 KB) Addjtional Information: full citation , abstract , references , citings , index 

We observe that the challenges software optimizers and microarchitects face every day 
boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource 
that contributes to execution time, such as a critical cache miss or window stall. Tasks 




such as tuning processors for energy efficiency and finding the right loads to prefetch all 
require measuring the performance costs of bottlenecks. In the past, simple event counts 
were enough to find the important bottlenecks. Today, t ... 

Keywords: Performance analysis, critical path, modeling, profiling 



20 A study of source-level compiler algorithms for automatic construction of pre- 

execution code 
Dongkeun Kim, Donald Yeung 

August 2004 ACM Transactions on Computer Systems (TOCS), volume 22 issue 3 
Publisher: ACM Press 

Full text available: fB)pdfH.55MB) Additional Information: full citation , abstract, references , citings, index 

terms 

Pre-execution is a promising latency tolerance technique that uses one or more helper 
threads running in spare hardware contexts ahead of the main computation to trigger 
long-latency memory operations early, hence absorbing their latency on behalf of the 
main computation. This article investigates several source-to-source C compilers for 
extracting pre-execution thread code automatically, thus relieving the programmer or 
hardware from this onerous task. We present an aggressive profile-driven co ... 

Keywords: Data prefetching, memory-level parallelism, multithreading, pre-execution, 
prefetch conversion, program slicing, speculative loop parallelization 
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Abstract 

Traditional software controlled data cache prefetching is often ineffective due to the lack o 
and miss address information. To overcome this limitation, we implement runtime data cac 
dynamic optimization system ADORE (ADaptive Object code Reoptimization). Its perform; 
compared with static software prefetching on the SPEC2000 benchmark suite. Runtime ce 
shows better performance. On an Itanium 2 based Linux workstation, it can increase perfc 
than 20% over static prefetching on some benchmarks. For benchmarks that do not benel 
the runtime optimization system adds only 1%-2% overhead. We have also collected each 
guide static data cache prefetching in the ORC compiler. With that information the compile 
avoid generating prefetches for loops that hit well in the data cache. 
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